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Recently, the problem of obtaining a short regular expression equivalent to a given finite automaton 
has been intensively investigated. Algorithms for converting finite automata to regular expressions 
have an exponential blow-up in the worst-case. To overcome this, simple heuristic methods have been 
proposed. In this paper we analyse some of the heuristics presented in the literature and propose new 
ones. We also present some experimental comparative results based on uniform random generated 
deterministic finite automata. 



1 Introduction 

Recently, the problem of obtaining a short regular expression equivalent to a given finite automaton has 
been intensively investigated. An extensive survey was presented by Ellul et al. [EKSW05], and more 
recently by Gruber and Holzer |GH08b]. It is well known that the problem of obtaining a minimal regular 
expression is PSPACE-complete and NP-complete for acyclic automata [J R93I . It is also inefficient to 
approximate a minimal regular expression [GS07 ], unless P=PSPACE. Classic algorithms for converting 
finite automata to regular expressions can produce regular expressions of size &(nk4 n ) in the worst 
case, where n is the number of states and k the alphabet size of the correspondent automaton. Several 
exponential lower bounds are provided in the literature [EKSW05, GH08a] showing that the exponential 
blow-up is unavoidable. For specific classes of automata, better upper bounds can be found [EKSW05, 
IGF081 ISak05l IMR091 In particular, Gruber and Holzer MGH08bl presented an algorithm that converts 
an n-state deterministic finite automaton (DFA) over a binary alphabet into a regular expression of size 
at most ^(1.742"). In general, to obtain shorter regular expressions it is essential the order in which 
the automaton's states are considered in the conversion. To tackle the problem of obtaining an optimal 
ordering in a feasible manner, heuristic methods have been proposed [ DM041 IHW071 IAH 09 1 . 

In this paper we analyse some of the heuristics presented in the literature and propose new ones. To 
test their performance, some experimental results were carried out using statistically significant samples 
obtained with an uniform random generator. The paper is organized as follows. In the next section some 
basic notions are reviewed. Section 3 summarizes the conversions from finite automata to regular ex- 
pressions, and in particular the state elimination method. Section 4 describes some elimination ordering 
strategies and two new ones are proposed. In Section 5 experimental results are analysed and Section 6 
concludes. 
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2 Preliminaries 



We recall some basic notions of digraphs, finite automata and regular expressions. For more details we 
refer the reader to standard books RHMU001 [S ak09[ |Har69l . 

A digraph D = (V,E) consists of a finite set V of vertices and a set E of ordered pairs of vertices, 
called arcs. If (u, v) in E, u is adjacent to (or incident to) v and v is adjacent from u. For each vertex v, the 
indegree of v is the number of vertices adjacent to it and the outdegree of v is the number n of vertices 
adjacent from it, and we write v(n,;n ). An arc (k,v) can be denoted by uv. A path between vo and v n 
is a sequence vrjVi,viV2, . . . ,v n _iv„ of arcs, and is denoted by vo • • • v n , or vo • • • vj • • • v„, for 1 < k < n. A 
path is simple if all the vertices in it are distinct. The length of a path is the number of arcs in the path. 
A path is a cycle if vo = v n and n > 1 . A digraph that has no cycles is called acyclic. 

We now review some notions and notation from formal languages and finite automata. Let E be a 
finite alphabet and E* be the set of words over E. The empty word is denoted by e. A language over 
E is a subset of E*. A regular expression (r.e.) a over E represents a regular language Jzf (a) C E* and 
is inductively defined by: is a r.e. and ££ (0) = 0; e is a r.e. and J£(s) = {£}; a G E is a r.e. and 
5£{a) = {a}; if (Xi and OC2 are r.e., (0C\ + CC2), («i«2), and (oil)* are r.e., respectively with j£?((GCi + 
a&)) = Jz? (oti) U JSf(Gfe), JSf((Oi<%)) = JSf(ai)JS?(0fc), and JS?((ai)*) = JSf(ai)*. The alphabetic size 
of an r.e. a is the number of alphabetic symbols of a and is denoted by \a\z- Let R be the set of regular 
expressions over E. Two regular expressions a and j3 are equivalent if ££{a) = J^f Q3), and we write 
a = j6. With this interpretation, the algebraic structure (R,+,-,0,e) constitutes an idempotent semiring, 
and with the unary operator *, a Kleene algebra. Using these algebraic properties as (simplification) 
rewrite rules, it is possible to decide if two regular expressions are equivalent, but no algorithm is known 
to minimize a given regular expression (except a brute-force one). 

A non-deterministic finite automaton (NFA) stf is a quintuple (Q,L,8,qo,F) where Q is a finite set 
of states, E is the alphabet, dCQxLxQis the transition relation, qo the initial state and F C Q is the 
set of final states. For q G Q and a G E, we denote the set {/? G (2 I {q, a ,p) £ 5} by 5(q,a), and we can 
extend this notation to w G £*, and to R C g. The language recognized by ,2/ is J£{£&) = {w G E* | 
5(^o,vv)nF ^0}. An NFA is deterministic (D FA) if for each pair (q,a) G gxE, |5(^,a)| < 1. A DFAis 
complete if 8 is a total function. An N FA is initially-connected if for each state g G Q there exists a word 
w G E* such that g G 5(#o,w). A complete initially-connected DFA is denoted by ICDFA. An NFA is 
trim if it is initially-connected and if every state is useful, i.e., for all q G Q there exist a word w G E* such 
that Fn8(q,w) 7^ 0. The underlying digraph of an NFA =2/ = (<2,£, 8,qo,F) is the digraph D = (Q,E) 
such that £" = {(<?,<?') | q,4 e 2 an d 3a G EU {e} such that {q,a,q') G 5}. Note that even if there can be 
more than one symbol of E between two states q and q', only one arc exists in the underlying digraph. 

For the conversion from NFAs to r.e.'s extended finite automata are considered. An extended finite 
automaton (EFA) srf is a quintuple (<2,E, 8,qo,F), where Q, E, qo and F are as before, and 8 : Q x Q — > 
R. We assume that 8(q,q') = 0, if the transition from q to q' is not defined. Any NFA can be easily 
transformed into an equivalent EFA, with the same underlying digraph: for each pair of states (q,q r ) one 

needs to construct a regular expression a\ H +a n such that (q,at,q') G 8, a; G EU {e}, 1 < i < n. 

This transformation corresponds to eliminate parallel transitions. Whenever appropriated we will use 
the same terminology both for digraphs and for automata. 
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3 From Finite Automata to Regular Expressions 

Kleene's theorem [Kle56] establishing the equivalence between languages accepted by finite automata 
and represented by regular expressions provided proof that a language accepted by an N FA can be rep- 
resented by a r.e.. McNaughton and Yamada HMY601 presented a recursive algorithm that calculates a 
r.e. from an NFA based on the computation of the transitive closure of the underlying digraph. Brzo- 
zowski and McCluskey HBJ631 introduced a method now known as state elimination algorithm (SEA) 
that considers EFAs and leads, in general, to simpler computations and shorter r.e.'s. A third method ex- 
ists based on solving a system of linear equations akin a Gaussian elimination process [Ard60, Koz94|. 
This last approach is interesting as linear algebra or optimization techniques can be adapted in order to 
provide new methods to obtain r.e.'s. Sakarovitch [Sak05 ( Sak09] studied the relationship between the 
three methods and in particular showed that given an order in the set of states Q the regular expressions 
obtained by two different methods can be reduced to each other by the application of a specific subset of 
algebraic properties. 

Most improvements and heuristic methods are based on the state elimination method and try to 
identify state orderings that lead to shorter r.e.'s. 

3.1 State Elimination Method Revisited 

The state elimination algorithm takes as input an EFA and produces an equivalent r.e.. In each step, a 
non-initial and non-final state of the EFA is eliminated (deleted) and the transitions are changed in such 
way that the new and the older EFAs are equivalent. Usually it is assumed that the input EFA is trim and 
normalized, i.e., the initial state has no incoming transitions, there is only a final state and that state has 
no outgoing transitions. Every EFA (or NFA) can be transformed into an equivalent normalized EFA. 
Formally, let £/ = (Q,L,8,q () ,F) be an EFA, then: 

Normalization: 

(NI) If there is q G Q such that 8(q,qo) / 0, then add a new state i to Q, define 8(i,qo) = e, and 

set i as the new initial state. 
(Nil) If \F\ > 1 or exists q G F and q' G Q such that 8(q,q') / 0, then add a new state / to Q and 
a transition 8(q,f) = e, for all q G F. The set of final states becomes {/}. 

Without lost of generality, let A' = (Q' ,L,8' ,i,f) denote the new normalized EFA. Let a qq > denote 
the regular expression 8{q,q'). Normalization is preserved when the below state elimination process is 
performed. 

State Elimination: 

(EI) If Q = {/,/}, then the resulting regular expression is ce,y, and the algorithm terminates. Oth- 
erwise continue to step (Efl]]). 

(EII) Choose q G Q \ {/,/}• Eliminate q from A', considering Q' \ {q} the new set of states, and 
for each qi,q 2 G Q'\{q}, 

& (<7l><72) = ^gi<72 ^qxq^qq^qqi i 

Continue to step (EQ). 

Hopcroft et al. [HMUOO] presented a slight variation of the above algorithm that omits the normaliza- 
tion step. Considering that there is only one final state, state elimination ends with one of the following 
EFAs (where some r.e.'s can be 0): 
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Initial state is final. There are two different states. 

In the left case, the final regular expression is j3* and in the right case, the final regular expression can be 
j8j*ai (p2 + or an y shorter r.e. if some of the transitions are labelled by 0. When \F\ > 1 the 

normalization step (Mill) should be considered. We refer, by abuse of language, to this algorithm as the 
SEA without normalization (SEAwn). It has the advantage of avoiding unnecessary e transitions, and, as 
we will see in Section |4~3l it exhibits a better performance for the elimination strategies. 

4 State Elimination Orderings 

The importance of the order in which the states are considered in the conversion, was noticed by the 
authors of the early algorithms. McNaugthon and Yamada suggested that states with higher in- and 
outdegrees should be considered at the end. Brzozowski and McCluskey proposed to eliminate first the 
states q G Q such that q(l; 1), i.e., q connects two other states in series: 

Acyclic NFAs for which in each step of the state elimination process there is a state satisfying these 
conditions were studied by Moreira and Reis [MR09] and called SP-automata. For this class it is possible 
to obtain a linear size r.e. in 0(n 2 \ogn) time. If an acyclic NFA is not SP, it must be reduced by series- 
parallel elimination to one that contains a subgraph of the form: 




And, in general, it is not easy to see which elimination ordering should be considered. 

The SP-automata strategy was extended by Gulan and Fernau [GF08] for a specific case of cyclic 
NFAs. SP-automata belong to the class of graphs which excludes a complete graph as a minor. For 
this class, Ellul and et al. proved that there are r.e.'s which size is less than e^V"). Gruber and Holzer 
extended this work to DFAs, providing an algorithm with a guaranteed performance of ^(1 .742") for 
binary alphabets. 

4.1 Delgado and Morais Heuristics 

In each step of the state elimination process, given q(m;l), the contribution of this state for the size of 
the final regular expression can be measured by 

m I 

W(q) = (I - 1) £ \a m \ + (in - 1) £ \a qqj \ + {ml- l)\a qq \. (1) 

i=i j=i 

Delgado and Morais [DM04 ] proposed a strategy (DM) that in each step eliminates a state q with the 
lowest weight W(q). Although this heuristic is quite simple and runs in &(n 2 ), the experimental results 
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provides evidence that it has very good performance. Recently, Gruber et al. [GHT09 ] presented more 
experimental results which showed statistical significance and were based on uniform random generated 
ICDFAs, where this heuristic almost always outperforms several others. Our results corroborate this good 
performance. In particular, when applied to an SP-automaton, this heuristics always selects a state q such 
that q(l;l), producing a linear size r.e.. 

4.2 Han and Wood Heuristics 

Han and Wood [HW07] introduced the notion of bridge state which leads to a decomposition of the EFA, 
therefore of the elimination process. That notion was redefined by Ahn and Han [AH09], as follows: a 
state q is a bridge state if it satisfies the following conditions: 

(BI) q is neither initial nor final; 

(BII) For any / G F, each path i ■ ■ -f must pass through q, i.e. , must be of the form i---q---f, where i 
is the initial state; 

(Bill) q does not participate in any cycle except for a loop. 

Note that bridge states correspond to the usual notion of cut points, with the extra constraint (BUlTI). 
Bridge states can be found in linear time, and it was proved that in an optimal elimination ordering the 
bridge states must be the last ones. This is easy to see because the automaton can be decomposed into 
two subautomata stf\ and s^i, such that a bridge state q corresponds to the final state of srf\ and the initial 
state of : 




Ahn and Han present some empirical results of this strategy (that we designed by HW) combined with 
the one based on state weights (DM) and also with one that performs a parallel decomposition of the 
EFA. Although the dataset used is random generated, it is not uniform nor statistically significant. 

4.3 SEA Without Normalization 

Consider the following simple DFA: 

b 




a 



Applying the SEA with normalization to this DFA and using the DM strategy, the first state to be eliminate 
corresponds to the initial state (i.e. it is the one with small weight). This will lead to a r.e. with the highest 
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alphabetic size (29), within all that can be obtained by state elimination. The elimination ordering is 0, 
3, 1, 2. 

On the other hand, if we consider a SEA with the Hopcroft et al. approach (such that the initial state is 
only considered at the end) applying the DM strategy will lead to a r.e. with the smallest alphabetic size 
(12). Now, the elimination ordering is 2, 1 (as the two other states are fixed). This strategy corresponds 
to combine the DM strategy with one where the initial state is the last to be eliminated. Our experimental 
results below show that this approach (SEAwn) improves, in general, the strategies we considered. 

4.4 A New Heuristic: Counting Cycles 

Consider, now, the following DFA 




The DM heuristics produces a r.e. with alphabetic size 29 or 26, if either SEA or SEAwn is considered. 
The corresponding elimination order are 1, 4, 0, 2, 3 and 1, 3, 2, 4, respectively. For this DFA the optimal 
alphabetic size for r.e. obtained by the state elimination method is 16 (and the worst is 126). Instead 
of the weight of a state being the weighted summation of its in- and out-degrees, one can consider the 
number of cycles that pass through it (multiplicities included). In this particular case the obtained r.e. 
has size 19. The number of cycles for each state is, by increasing identifier order, 4, 3, 4, 3 and 2, 
respectively. 

Two strategies can be developed to obtain an elimination ordering: 

(CI) statically determine the number of cycles for each state q, of the original automaton (CS); this 
can be achieved in 0(n 2 ). 

(CII) dynamically determine those values after each elimination step (CD); this can be achieved in 

In the second case, (dip), instead of the multiplicities, the alphabetic size of each transition label is 
considered. 

5 Experimental Results 

Each of the state elimination algorithms described before was implemented in Python within the FAdo 
system IMR05I IAAA+091 IFAdlOI . The experiments were undertaken with samples of 10,000 uniform 
random generated ICDFAs [AMR07] with a fixed number of states (n) and alphabet size (k). The sample 
size ensures the statistical significance with a 95% confidence level within a 1% error margin. Most of 
the tests were performed for automata with n € {10,20,50} states and k £ {2,3,5, 10,26, 100} symbols. 
Each generated automaton is represented by a canonical string. Assuming an ordering on the alphabet, 
the states are numbered from to n — 1, being the initial state. The string representation is a list of 
states reached from each state by increasing order of symbols and of state numbering, beginning with the 
initial state. For example, the string for the DFA of Section 1431 considering a < b, is 12312312. 
Experiments were carried out considering the following goals: 
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• to determine the density of occurrence of bridge states in (complete) DFAs. 

• to test the performance of SEAwn, i.e. the state elimination method without normalization, inde- 
pendently of other elimination ordering strategies; 

• to test the performance of the strategies based on counting the number of cycles. 
5.1 Bridge States Density 

The performance of the strategy HW proposed by Han and Wood, and described in Section |4~2l heavily 
depends on the existence of bridge states in a finite automaton. We estimated the occurrence of these 
states in ICDFAs, and their average position in the ICDFA canonical string. In the string representation, 
an early position corresponds to a closer proximity to the initial state. Thus this index measures the state 
distance from the initial state and gives information about the number of states of each subautomaton in 
which the ICDFA can be decomposed. In the following table, and for each sample, tot is the total number 
of bridge states, num is the number of ICDFAs with at least a bridge state and pos is their average position 
in the ICDFA canonical string. The table values suggest that bridge states are very rare and a bridge state 
is usually the initial state or adjacent from it. Note that for larger alphabets (k > 10) no bridge states, at 
all, were found. 





k = 2 


k = 3 


k = 5 


k=!0 




tot 


num 


pos 


tot 


num 


pos 


tot 


num 


pos 


tot 


num 


pos 


n= 10 


3252 


2327 


0.824 


829 


707 


0.458 


88 


82 


0.193 








N/A 


« = 20 


3506 


2375 


1.224 


757 


634 


0.486 


73 


71 


0.123 








N/A 


« = 50 


3499 


2411 


1.375 


758 


649 


0.451 


69 


63 


0.115 








N/A 



5.2 SEAwn Performance 

To test the performance of the SEAwn method, several elimination ordering strategies were considered. 
A trivial order is the one in which the states occur in the ICDFA canonical string. This ordering produces 
very bad results (even compared with a random one) but here we wanted to test the effect of the prior 
automata normalization. The correspondent algorithms are S and Swn, respectively. We also considered 
the DM strategy with the SEAwn method (DMwn). For each pair of algorithms, the ratio between the 
average r.e. alphabetic sizes was computed. The following bar charts summarize some of the results. 
The Swn method (without normalization) always outperforms the S (with normalization). Because the 
r.e. sizes are huge some ratios are very small. For example, a ratio of 0.08, for n = 50 and k = 10, 
corresponds to the diminishing of two orders of magnitude (from 10 27 to 10 25 ). The DMwn method can 
achieve an improvement of 15% over the DM one. 
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n = 20 



.047 .036 053 



2 3 5 10 26 100, 




2 3 5 10 26 100, 



71 = 50 



033 .016 .016 .008 043 



2 3 5 10 26 100, 




2 3 5 10 26 100, 



Swn/S 



DMwn/DM 



5.3 Cycle Heuristic Performance 

The two heuristics presented in Section |4~4l CS and CD, were implemented using the SEAwn method. It 
was then natural to compare their performance with DMwn, the best heuristic so far. The following table 
summarizes the results. The third to the fifth columns have the average r.e. alphabetic sizes obtained for 
each of the mentioned heuristics. The sixth column corresponds to the average of the minimum value of 
the three, the best of the 3 (B3). The three last columns contain the maximum values obtained by each of 
the heuristics. 



k 


n 


DMwn 


CS 


CD 


B3 


MDMwn 


MCS 


MCD 


2 


10 


149 


144 


143 


135 


864 


1014 


909 




20 


1557 


1531 


1617 


1331 


12494 


18235 


16230 




50 


3.5 x 10 s 


4.9 x 10 s 


5.5 x 10 5 


2.5 x 10 s 


7.8 x 10 6 


1.9 x 10 7 


2.5 x 10 7 


3 


10 


633 


617 


628 


564 


4792 


4206 


5095 




20 


23431 


25817 


27560 


18739 


339595 


365533 


428164 




50 


2.5 x 10 8 


7.6 x 10 8 


6.5 x 10 8 


1.6 x 10 8 


1.0 x 10 10 


1.6 x 10 11 


8.9 x 10 10 


5 


10 


4492 


4646 


4713 


3942 


32780 


34044 


35508 




20 


1.0 x 10 6 


1.5 x 10 6 


1.4 x 10 6 


8.2 x 10 s 


1.2 x 10 7 


2.8 x 10 7 


2.7 x 10 7 




50 


5.5 x 10 12 


3.5 x 10 13 


2.0 x 10 13 


3.2 x 10 12 


4.4 x 10 14 


5.3 x 10 15 


3.1 x 10 15 


10 


10 


52943 


59921 


57138 


47564 


232338 


430391 


262446 




20 


1.8 x 10 8 


3.1 x 10 8 


2.7 x 10 8 


1.4 x 10 8 


1.7 x 10 9 


9.9 x 10 9 


3.2 x 10 9 


26 


10 


6.0 x lO 5 


7.1 x 10 i 


6.5 x 10 3 


5.8 x 10 s 


1.1 x 10 b 


1.7 x 10 b 


1.5 x 10 b 




20 


3.3 x 10 10 


5.7 x 10 10 


4.4 x 10 10 


2.9 x 10 10 


1.3 x 10 n 


3.8 x 10 11 


1.8 x 10 n 


100 


10 


4.1 x 10 b 


4.2 x 10 b 


4.1 x 10 b 


4.1 x 10 b 


5.3 x 10 b 


5.6 x 10 b 


5.5 x 10 b 




20 


1.5 x 10 12 


1.7 x 10 12 


1.6 x 10 12 


1.4 x 10 12 


2.1 x 10 12 


2.9 x 10 12 


2.7 x 10 12 



On average, the heuristics DMwn outperforms the other two, although not always. However, the 
performance of the cycle heuristics are of the same order of magnitude. The comparison between CS 
and CD is hard to interpret. The overhead of reevaluate the cycle weights after each step seems not 
worthwhile. This suggest that the CS strategy is a good choice, even compared with DMwn, as the 
weights are computed only once. The most important result is that considering the three heuristics a 
better value is always obtained (B3). This means that when DMwn produces a bad value one of the other 
two produces a better value, and vice versa. This is surprising, and deserves future research. 
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6 Conclusions 

Several state elimination ordering strategies were analysed and new ones were proposed. Experimental 
results were conducted with statistical accurate samples of uniform random generated deterministic finite 
automata. In this context the following conclusions can be drawn: 

• a general improvement in all strategies is obtained if the SEA without normalization is considered; 

• bridge states are very rare; 

• the HW strategy clearly clash with the new strategies based on the number of cycles count (CS 
and CD), because bridge states are cycle free; but, as we saw, their rarity makes this contradiction 
unimportant; 

• the new proposed strategies (CS and CD) are comparable with the DM heuristic; however these 
new heuristics only outperform, on average, the DM heuristic for automata with small alphabets 
and small number of states; 

• if one takes as strategy, for each automaton, the best result from these three heuristics (DM, CS 
and CD) a gain of 25% is obtained, with the same worst case complexity, 0(n 3 ). 

Part of our planned future work is to gain some theoretical understanding of these facts. Furthermore, 
we conjecture that a more sophisticated hybridization of these three heuristics could lead to even better 
results. 
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